Precise Zero-Shot Dense Retrieval without Relevance Labels

it remains difficult to create effective fully zero-shot dense retrieval systems when no relevance label is available. (Abstract)

#HyDE (Hypothetical Document Embeddings)

Given a query, HyDE first zero-shot instructs an instruction-following language model (e.g. InstructGPT) to generate a hypothetical document.

Then, an unsupervised contrastively learned encoder~(e.g. Contriever) encodes the document into an embedding vector.

(Contriever -> Unsupervised Dense Information Retrieval with Contrastive Learning 1より)

Our experiments show that HyDE significantly outperforms the state-of-the-art unsupervised dense retriever Contriever and shows strong performance comparable to fine-tuned retrievers, across various tasks (e.g. web search, QA, fact verification) and languages~(e.g. sw, ko, ja).

jaデータセット？

https://github.com/texttron/hyde

https://github.com/texttron/hyde/blob/main/approach.png?raw=true

4.1 Setup (4 Experiments)

Datasets

web search query sets (-> 4.2, Table 1)

TREC DL19（Overview of the TREC 2019 deep learning track）

TREC DL20（Overview of the TREC 2020 deep learning track）

2つともMS MARCOがベース

diverse collection of 6 low-resource datasets (-> 4.3, Table 2)

BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models

non-English retrieval: Mr. TyDi: A Multi-lingual Benchmark for Dense Retrieval

LangChainに入っている https://twitter.com/LangChainAI/status/1605962865449598979

埋め込み検索を利用する際、検索対象とクエリのフォーマットを統一することで精度が上がるのではないか